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What Study Design Should I Choose to Measure Program Effects? 

Researchers who plan to study the effectiveness of a policy, program, or practice should choose a 
feasible study design that maximizes scientific rigor for the context and fits within cost and 
operational constraints. 

Randomized controlled trials (RCTs), or 
“experiments,” provide strong evidence of 
effectiveness. Random assignment ensures that 
treatment and control groups do not differ except 
for receipt of the intervention. In a well-designed 
and well-implemented RCT, researchers can be 
more confident than when using other research 
designs that they are measuring program effects 

and not effects of something else. 

Implementing an RCT is not always feasible. For example, providers may be unwilling or unable 
to limit participation in a program to some students when the program being studied has more 
seats available than applicants. Also, funders or policymakers may decide to begin a study after a 
program is under way. A late start to a study makes random assignment infeasible except in 
special circumstances . 1 

When conducting an RCT is not 
possible, a strong quasi- 
experimental design study 
(QED), or quasi-experiment, can 
provide valuable evidence about 
a program’s effectiveness. This 
brief discusses best practices and 
objectives in designing and 
implementing strong QEDs and 
presents answers to frequently 
asked questions from developers 
and researchers who want their 
studies to meet the U.S. 

Department of Education’s 
standards of rigor, as defined by 
What Works Clearinghouse 
(WWC). The brief also 
summarizes common pitfalls that 
cause QEDs not to meet WWC 
standards for group design 
studies. Group design studies 


How key terms are used in this brief 

Treatment (intervention): the policy, program, practice, or 
strategy that will be evaluated. 

Treatment group: the group of individuals, classrooms, 
schools, districts, or institutions participating in the study 
and the intervention. (Studies sometimes call this the 
participant group or the intervention group) 

Comparison group: the group of individuals, classrooms, 
schools, districts, or institutions participating in the study 
but not participating in the intervention. Although often 
used interchangeably in many studies and reports, in this 
brief, we refer to this group as the “comparison group” for 
QED studies and “control group” for RCT studies. 

Strong QED: a quasi-experimental design study that meets 
standards for credible evidence of effectiveness. In this 
brief, we focus primarily on the What Works Clearinghouse 
evidence standards. 


Researchers and developers 
should always consider first 
whether it is feasible to implement 
a randomized controlled trial to 
examine program effectiveness. 


1 For example, some charter schools use lotteries to allocate their open spaces, and it is possible to use the lotteries 
after the fact to set up an experiment. Other than these special circumstances, nearly all experiments are planned 
from the start. See Resch, Berk and Akers (2014) for guidance on recognizing and conducting opportunistic 
experiments in education field settings. 
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include all studies that use a comparison group to assess the effectiveness of a treatment or 
intervention. 2 


Best Practices in Designing and Implementing a QED 

A key principle in designing quasi-experiments is that the more the quasi-experiment is like a 
true experiment, the stronger its validity (Rosenbaum 2010). In an experiment, random 
assignment creates both (1) a treatment group and (2) a control group that is the treatment 
group’s mirror image. A QED will be stronger if its comparison group is as close as it can be to a 
mirror image of the treatment group. Because a QED is not using random assignment, it should 
use other approaches to create the mirror-image comparison group. 

Exhibit 1. In a strong QED, the comparison group will be close to a mirror image of the 
treatment group. 

xM b hit 

Treatment Group Comparison Group 


The analogy of a mirror image is a useful way to think about an ideal quasi-experiment. In 
practice, a comparison group for any QED is not a perfect mirror image of its treatment group. 
Even if what can be seen or measured (for example, gender, race/ethnicity, achievement, or 
previous experience) about the groups is exactly equivalent, there is no presumption that what 
cannot be seen will be equivalent, as there is in an experiment. And, crucially, it is not possible 
to test whether these characteristics differ, for the simple reason that they cannot be measured. 

This paper describes four key best practices for creating the best possible mirror-image 
comparison group and conducting strong QED studies. Following the discussion of each of these 
best practices, we present frequently asked questions that address specific issues in conducting 
QED studies. 

Best Practice 1: Consider Unobserved Variables — Related or Unrelated to 
Outcomes 

Sometimes “unobserved” differences between a treatment and comparison group are unrelated to 
the outcomes on which researchers are measuring impacts. When this is the case, these unrelated 
unobserved differences would not have any influence on the magnitude of a study’s impact 


2 The WWC standards that are relevant for RCTs or QEDs and applied here are called “group design” standards. 
Group design studies measure impacts by comparing outcomes for a treated set of individuals versus a comparison 
set of individuals, not by comparing outcomes for the same individuals over time. This brief focuses on aspects of 
the WWC group design standards specifically related to QEDs, referred to simply as “the WWC standards.” For a 
comprehensive discussion of all WWC evidence standards, including group design standards, consult the What 
Works Clearinghouse Procedures and Standards Handbook, Version 3.0 . 
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estimates. For example, researchers evaluating a postsecondary developmental math program do 
not need to worry that more students in the treatment group than in the comparison group like 
peanut butter and jelly sandwiches. Taste in sandwiches is not correlated with math skills. 

Unobserved variables that are related to outcomes, however, will lead to biased impact estimates. 
Suppose that a school offers a voluntary summer catch-up program for children who are reading 
below grade expectations. Parents who enroll their children in this program may be different in 
important, hard-to-observe ways from those who do not enroll their children. For example, they 
may read to their children more often than parents whose children did not enroll. That difference 
by itself could create a difference in children’s motivation to read, even for students who are the 
same on other dimensions such as age, gender, family income, and spring test scores. If more 
motivated students with greater parent support enroll in the catch-up program, their reading skills 
may improve faster over the summer than those of nonenrolled students, independently of any 
effect the catch-up program might have had (Exhibit 2). 

Exhibit 2. The comparison group may look the same as the treatment group but may differ 
in ways that researchers cannot observe or measure (like motivation), making it hard to 
argue that differences in outcomes are due solely to the treatment. 


More motivated 


Less motivated 



Treatment Group Comparison Group 


Exhibit 2 above illustrates a key limitation facing any quasi-experiment. When a study estimates 
program effects without measuring characteristics that are related to outcomes, part of the 
measured effect may arise because of differences in, say, motivation to read. The study cannot 
say how much of the measured effect is a real program effect and how much is due to differences 
in unmeasured characteristics. And no matter how much effort the study invests in gathering data 
to increase the number of measured characteristics (such as by surveying parents about their 
literacy practices when their child was younger), something relevant will always be unobserved 
and unmeasured. 


Researchers can undertake efforts at the design, analysis, and interpretation phase of a study to 
reduce the chance that alternative explanations may be driving the study’s results. At the design 
phase, researchers should consider whether the groups considered for comparison are drawn 
from very different populations or settings. Also, as is discussed in “Best Practice 2,” at the 
design phase, researchers can think about how to apply certain matching techniques to try to 
minimize differences between treatment and comparison groups in characteristics that can be 
measured and are most likely to be related to the outcomes of interest. These measured 
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characteristics could be used to create matched samples and also can be included as control 
variables during the analysis phase. While measuring as many characteristics related to outcomes 
as possible will help to minimize potential bias, it can’t eliminate bias. For this reason, thinking 
about possible remaining unmeasured factors can help researchers interpret results of impact 
analyses. The consideration of unmeasured factors in interpretation is especially important if 
hypothesized results run counter to prior evidence or to the researchers’ expectations. 

Best Practice 2: Select Appropriate Matching Strategies for Creating 
Comparison Groups 

One way to make sure that the comparison group is as nearly equivalent to the treatment group 
as possible is to use a high-quality matching strategy to fonn the comparison group. Strategies 
for creating comparison groups range from conveniently identifying a group that is “like” the 
treatment group to carefully selecting a group by using matching techniques. For example, a 
convenient approach might be to select students in neighboring schools who are not 
implementing a particular program. These students could be used as a comparison group for a 
treatment group in schools that are using the program. This approach is inexpensive and 
straightforward to implement, but it risks creating groups that are not equivalent on important 
characteristics, which would be evident only after data are collected. 

A more complex matching strategy is likely to produce a more nearly equivalent comparison 
group. Careful matching strategies that identify comparison group individuals who satisfy some 
metric of equivalence or closeness to treatment group individuals (“individual” could refer to 
students, teachers, schools, districts, postsecondary institutions, or higher education systems) are 
better options for forming a comparison group than relying on a convenience sample. 

Rosenbaum (2010) surveys the vast literature, discusses high-quality approaches to matching, 
and provides useful examples. 

Best Practice 3: Follow General Guidelines for Sound Research 

QED studies, like other evaluation studies, will provide the most reliable answers to research 
questions when they follow broader guidelines for sound research. In sound research, 

• Studies should specify clear research questions up front. 

• Studies should determine sample design and data collection approaches to answer the 
research questions. The sample design should specify clear eligibility criteria, methods for 
forming the research sample, and sample sizes necessary to detect meaningful impacts of the 
intervention on key outcomes. The data collection plan should identify valid, reliable 
outcome measures needed to answer the research questions. 

• Plans for analysis should reflect the research design and sample selection procedures. 

To produce the strongest evidence of a treatment’s effectiveness, both the intervention and the 
research plans must be well implemented. Researchers who conduct effectiveness research in 
field settings such as classrooms and schools often encounter challenges and hurdles to 
implementing the intervention or analyzing its effects. But starting with a clear plan and being 
flexible will yield stronger research than starting with a vague plan and hoping for the best. In 
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field settings, researchers should expect that unplanned or unforeseen events will hamper study 
designs rather than strengthen them. 

Appendix A presents (1) a checklist of issues to consider when designing strong QEDs and (2) a 
supporting table that provides a comprehensive discussion of the checklist. The table defines 
each key design issue, explains the extent to which the WWC considers the issue in determining 
a study rating, and documents general considerations for good practice. 

Best Practice 4: Address the What Works Clearinghouse (WWC) Standards 

In addition to the general guidelines for sound research discussed in “Best Practice 3,” the WWC 
standards are a good source of information about best practices for conducting evaluations. 

These standards were identified from the research literature and in consultation with 
methodological experts. For quasi-experiments, the WWC does not examine the matching 
approach used by the researchers. The WWC does, however, review studies against several 
standards, some of which pertain to both 
experiments and quasi-experiments. For QEDs to 
meet WWC standards with reservations (the 
highest possible rating for QEDs), they must 

• Compare two distinct groups, not the same 
group before and after a treatment. 

• Use appropriate outcomes 3 that are 

■ Valid: that is, the study measures what it 
says it measures. 

■ Reliable: that is, the outcome is measured consistently and accurately. 

■ Measured in the same way (using the same instrument and at the same time) for the 
treatment and comparison groups. 

■ Not too closely aligned with the treatment. “Overalignment” — such as when a reading 
test includes questions that relate only to the reading program being studied — leads to 
inaccurate measurement of program effects because one group will naturally score 
differently than the other. 

• Demonstrate baseline (pre-intervention) equivalence of the treatment and comparison 
groups on key characteristics specified in the relevant WWC review protocol. 4 Studies must 
demonstrate pre-intervention equivalence for their analytic sample (in other words, the 


Because a strong QED study 
cannot rule out potential bias in 
the impact estimates, the highest 
rating it can receive from the 
WWC is “meets WWC group 
design standards with 
reservations.” 


3 If a study does not have any acceptable outcomes, then it could not meet WWC standards. If at least one outcome 
is acceptable, a QED study is eligible to meet WWC standards with reservations. 

4 These characteristics are determined by researchers with expertise in the topic area and are specified in a protocol 
used to guide the WWC reviews of studies in each subject area. The protocols are a useful starting point for 
researchers designing a study that they want to meet WWC standards with reservations because the protocols 
specify a number of study parameters, including eligible populations (for example, age-range of study participants), 
interventions, outcomes, and baseline data that would need to be collected and analyzed to demonstrate equivalence 
of the analytic sample. The protocols can be found at the WWC website: 

http://ies.ed.gov/ncee/wwc/Publications Reviews. aspx?f=All%20Publication%20and%20Product%20Types,5;#pub 
search. 
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sample used to measure program impacts). To be considered equivalent, the WWC standards 
require that the characteristics of the treatment and comparison groups before participation in 
the intervention be within 0.25 standard deviations. And if differences are between 0.05 and 
0.25 standard deviations, impact estimates must statistically adjust for these differences by 
using methods such as regression analysis or analysis of covariance. If all differences are 
smaller than 0.05 standard deviations, then effects can be measured as simple differences of 
means. 

• Be free of confounding factors. A confounding factor is one that affects outcomes of one 
group but not the other and is not part of the treatment being studied. When confounds arise, 
it is impossible to know whether measured effects are from the treatment, from the 
confounding factor, or from some combination. Examples of common confounds include 
time (for example, treatment and comparison groups come from different school years) and 
school or classroom (when either the treatment or comparison group comes from a single 
school or classroom), but there are others. 5 


5 When outcomes for the treatment group come from one school year and outcomes for the comparison group come 
from a different school year, the measured effect is confounded with differences that may arise between the different 
school years because of, for example, year-to-year changes in leadership and staffing, other programs that were 
implemented or taken away from one year to the next, or external issues that may have affected outcomes in one 
year but not the other (for example, major weather-related interruptions). A second common example is a case in 
which a treatment is implemented in only one classroom. In this situation, the measured effect is confounded with 
the effects of the classroom teacher. A third example is a study in which all treatment schools are in one district and 
all comparison schools are in another school district. In this case, the measured effect confounds the treatment 
effects and differences between the two districts. A fourth example is a case in which a treatment is always delivered 
in combination with another treatment — for example, when a first-year college transition course is offered at a 
student support center where there is easy access to tutors and mentors. In this case, the study is measuring the 
combined effect of both the transition course and the enhanced access to supports. If the effects of only one of the 
treatments is being examined in a review, the confound means that the study does not meet WWC standards for 
measuring the impact of that specific treatment. 
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Frequently Asked Questions about Meeting WWC Standards 

The following sections present frequently asked questions (FAQs) about meeting WWC 
standards that are most relevant to QED studies in key topic areas (study design and group 
formation, outcomes, confounding factors, baseline equivalence, sample loss and power, and 
analytic techniques). General (FAQs) about the WWC are available here . Also, we encourage 
you to browse the resources section of the WWC website, which provides useful documents and 
links to WWC webinars and databases. Notably, the What Works Clearinghouse Procedures and 
Standards Handbook, Version 3.0 provides the most detailed discussion of the WWC evidence 
standards. 


Study Design/Group Formation 


Ql: What kinds of comparison group designs are eligible to “meet WWC group design standards 
without reservations”? 

Answer: The only studies that are eligible to receive a rating of “meets WWC group design 
standards without reservations” are well-implemented randomized controlled trials (RCTs). A 
WWC webinar on July 21, 2014 provides extensive advice on implementing a strong RCT study. 
Quasi-experimental designs are not eligible for this rating. 


Q2: What kinds of nonrandomized comparison group designs are eligible to “meet WWC group 
design standards with reservations”? 

Answer: The WWC does not have any requirements about how treatment and comparison 
groups are constructed. All QEDs in which there are distinct treatment and comparison groups 
can be reviewed against the WWC standards for group comparison designs, and the highest 
rating that those studies can achieve is “meets WWC group design standards with reservations.” 
A WWC webinar on March 3, 2015 provides extensive advice on implementing a strong QED 
study. 


Q3: Can a QED study with a comparison group composed of a convenience sample of 
individuals who did not volunteer for the intervention meet WWC standards? 

Answer: Yes, if the study demonstrates baseline equivalence of the variables required in the 
relevant WWC topic area protocol, it can meet WWC standards with reservations. The WWC 
website publishes all IES-approved protocols . 


Q4: Can a study that only measures one sample before and after a treatment meet WWC 
standards? 

Answer: No. To be eligible to be reviewed under WWC group design standards, there must be 
two distinct groups: a treatment group that receives a treatment and a comparison group that 
does not. Studies of groups that serve as their own controls can only be reviewed by the WWC 
under very specific circumstances and only under WWC pilot single-case design standards. For 
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more information on these pilot standards, consult Appendix E of the What Works Clearinghouse 
Procedures and Standards Handbook, Version 3.0 . 


Q5: Can a QED study that compares an earlier cohort to a different later cohort “meet WWC 
group design standards with reservations”? 

Answer: No. Comparing two different cohorts introduces a confound that makes it impossible to 
determine whether the program being tested is responsible for differences in outcomes between a 
treatment and comparison group or whether some other competing explanation accounts for 
differences in outcomes. In most schools, classrooms, and programs, a variety of things change 
from year to year other than the intervention being evaluated. When a historical cohort is used as 
a comparison group, it is not possible to assess whether other activities have affected outcomes. 
For example, suppose that a new program is implemented in a set of schools. Outcomes for 
students at the end of that year are then compared to outcomes of students enrolled in the same 
schools in the prior year. It is not possible to separate differences in outcomes between the earlier 
and later cohorts that are due to the treatment versus differences that are due to other changes 
between the two time periods. A new district-wide or institution-wide policy or some external 
force may have affected outcomes. 


Q6: Can a QED study that compares two different doses — for example, one year versus two 
years of exposure — of the same intervention be eligible to meet WWC standards with 
reservations? 

Answer: No. WWC reviews focus on measuring full program effects. A study measuring 
differences in dosage would generally not be eligible for review by the WWC because it would 
not be possible to determine whether the full intervention was having an effect. 


Q7: Can a QED meet WWC standards with reservations if the comparison group receives an 
alternate treatment? For example, could I examine how one new curriculum compares to another 
new curriculum, or how one strategy for delivering a developmental math course compares to a 
specific different strategy for delivering developmental math? 

Answer: Yes, but under some circumstances, the study may be excluded from a WWC review of 
overall intervention effectiveness. In the education research field, comparison groups often 
receive an alternate educational experience, whether it be “business as usual” (meaning that a 
school continues its status quo curriculum or set of supports) or a new, alternative curriculum or 
set of supports. When a study compares a new treatment to either a “no treatment” or a “business 
as usual” condition, it tries to answer the question, “what is the overall effect of this treatment in 
comparison to what students would have had if they had not had the opportunity to receive the 
new treatment?” When a study compares two treatments, then it tries to answer the question, 
“How much better (or worse) do students fare when they participate in one treatment versus the 
other?” 

For example, if a study is comparing the effects of a literacy curriculum combined with a new 
supplemental literacy course to the same curriculum with no additional supports, then it will be 
able to measure how students who were given additional supports fare compared to those who 
did not have that opportunity. The WWC reports the results of this kind of study (and other 
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similar studies) in an intervention report 6 to help provide researchers, practitioners, and policy 
makers with evidence of overall program effectiveness of the supplemental literacy course. 
Conversely, if two supplemental literacy course curricula are compared “head to head” in a 
study, then the study results will help provide information to practitioners and policy makers 
who are trying to choose which supplemental course to implement. While this is useful 
information for those who are selecting between two programs, the results will never provide any 
evidence to support or refute the fact that a supplemental literacy course will improve outcomes. 
This kind of “head to head” comparison can meet WWC standards but would be treated 
differently in WWC reports . 7 


Q8: Can a QED study that uses pre-existing data to identify a comparison group meet WWC 
standards? 

Answer: Yes, a QED study may use pre-existing data to identify a comparison group. Whether 
or not the baseline data already existed in administrative records or were collected for the QED 
study, the study must demonstrate that the pre-intervention characteristics of the groups are 
equivalent. 


Q9: Can a QED in which the treatment and comparison groups are clustered — for example, 
within classrooms or schools — meet WWC standards with reservations? 

Answer: Yes. Nonrandomized clustered QED studies can meet WWC standards with 
reservations. To do so, they must demonstrate equivalence at the cluster level if the analysis is 
measuring cluster-level effects. If a treatment is clustered at, for example, the school or 
classroom level, but the analysis makes inferences at the student level, then the study must 
establish equivalence at the student level to meet WWC standards with reservations. 


Outcomes 


Q10: What kinds of outcomes should be measured in order to meet WWC standards? 

Answer: Evaluations designed to meet WWC standards should always include at least one 
outcome that falls within the acceptable list of outcomes as defined by specific WWC review 
protocols . In situations where a relevant WWC protocol is not available, then researchers should 
focus on outcomes related to achievement, progression through school, completion of education 
programs, degree attainment, and student behaviors. Outcomes such as attitudes or perceptions, 
while often important to measure as mediators, are not reviewed by the WWC and thus should 
not be the sole focus of a research study that hopes to meet WWC standards. 


6 Examples of intervention reports can be found on the WWC website at 

http://ies.ed.gov/ncee/wwc/Publications Reviews. aspx?f=All%20Publication%20and%20Product%20Types,l;#pub 
search . 

7 The WWC publishes four kinds of reports: intervention reports, practice guides, single study reviews, and quick 
reviews (see http://ies.ed.gov/ncee/wwc/publications reviews. aspx ). 
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Confounding Factors 


QH: Can a QED that uses existing data meet WWC standards with reservations if the data for 
the comparison and treatment groups were not collected at the same time? 

Answer: No, if the data were not collected at the same time, then the study has a confounding 
factor (time) and cannot meet WWC standards. 


Q12: Can a study that compares only one treatment school with one or more comparison schools 
meet WWC standards with reservations? 

Answer: No. The WWC would consider this design to be confounded because there is only one 
unit (school) assigned to at least one of the treatment or comparison conditions. If there is only 
one unit, then some other characteristic (for example, teacher quality or alternative curricula 
available at that school) could explain differences in outcomes. For this reason, the WWC 
requires that both the treatment and control conditions contain at least two units to be eligible to 
meet WWC standards with reservations. This is also true when the intervention is clustered at the 
district, teacher, classroom, or any other level. 


Q13: Does the WWC consider a factor that is not related to the outcome to be a confounding 
factor? 

Answer: No. A confounding factor is a component of the study design or the circumstances 
under which the intervention was implemented that is perfectly aligned with either the treatment 
or comparison group. That is, a confounding factor is present for members of only one group and 
absent for all members in the other group. The confounding factor must present a reasonable 
alternative cause for the effects that are found. A factor unrelated to the outcome, such as eye 
color in an education study, would not be considered a confounding factor during a WWC 
review. 


Q14: When a treatment is a curriculum and the teacher must implement the curriculum, does the 
teacher need to teach both the treatment and comparison groups so that there are no concerns 
about the particular effects of that teacher to meet WWC standards with reservations? 

Answer: The WWC considers a QED study with one teacher who teaches multiple classrooms in 
each of the treatment and comparison conditions as eligible to meet WWC standards with 
reservations. However, teachers are not required to teach both the treatment and comparison 
groups as long as there are multiple teachers or classrooms in the study in both the treatment and 
comparison groups. In this latter case, there must be at least two teachers in the treatment group 
classrooms and two different teachers in the comparison group classrooms to be eligible to meet 
WWC standards with reservations. 
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Baseline Equivalence 

Q15: On what characteristics must QED studies demonstrate treatment and comparison group 
equivalence to meet WWC standards with reservations? 

Answer: In general, most WWC topic area protocols require that studies demonstrate 
equivalence on a baseline measure of the outcome. Many protocols also require or consider 
equivalence on other baseline characteristics (such as race-ethnicity and socioeconomic status) 
that topic area experts have identified as likely to be strongly associated with outcomes. The 
WWC posts all topic area protocols that describe requirements for baseline equivalence on its 
website. If a relevant WWC topic area protocol is not available, researchers should consider 
using the key baseline characteristics that are highly correlated with outcomes when 
demonstrating equivalence of the treatment and comparison samples. 


Q16: How closely equivalent must the pre-intervention characteristics of the treatment and 
comparison group be to meet WWC standards with reservations? Can a study simply statistically 
control for characteristics that have big differences between groups pre-intervention?? 

Answer: The WWC standards indicate that baseline equivalence is determined by calculating the 
effect size difference between the treatment and comparison groups for each of the required 
baseline characteristics. An effect size is calculated as the difference in means between the 
treatment and comparison groups, divided by the pooled (treatment and comparison group) 
standard deviation. If an effect size difference is greater than 0.25, then the comparison “does not 
meet WWC design standards.” If an effect size difference falls between 0.05 and 0.25, then the 
study must statistically control for this baseline measure in its impact analysis in order for the 
result to meet WWC standards with reservations. If the effect size difference is less than 0.05, 
then the study result is eligible to meet WWC standards with reservations, regardless of whether 
this measure is controlled for in the analysis. 8 The WWC does not consider whether differences 
in baseline measures are statistically significant when assessing whether groups are equivalent. 
Requirements for which characteristics must demonstrate equivalence varies by WWC topic area 
(see Q14 for more information about which characteristics must be equivalent at baseline). 


Q17: How should I make statistical adjustments when effect size differences in baseline 
characteristics are between .05 and .25 to meet WWC standards with reservations? Is it okay to 
include statistical adjustments for baseline variables in the impact analyses when the treatment 
and comparison groups are shown to be equivalent at baseline? 

Answer: Typically, making statistical adjustments for effect size differences between .05 and .25 
involves estimating impacts by using a regression model that includes the key baseline variables 
as covariates. The WWC standards do not require statistical adjustments when the baseline effect 
size differences are .05 or less, but such adjustments can be made to potentially improve the 
precision of estimates. 


8 WWC experts set these thresholds based on Ho, Imai, King, & Stuart (2007). 
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Q18: May survey data be used to demonstrate equivalence to meet WWC standards with 
reservations? 

Answer: Yes. The WWC standards allow data from any source eligible for determining baseline 
equivalence, including surveys, test scores, reliable and valid observations, or administrative 
records. 


Q19: Can a study establish equivalence by using a pretest that is different from the outcome 
measure to meet WWC standards with reservations? 

Answer: In general, a study can still meet WWC standards with reservations when 
demonstrating baseline equivalence on a similar outcome measure, provided that this measure 
has appropriate validity and reliability characteristics and falls within the same outcome 
“domain.” WWC topic area protocols specify these outcome domains. For example, in reviews 
of postsecondary interventions, the “credit accumulation” domain includes outcomes such as 
number of credits earned, ratio of credits earned to credits attempted, or persistence measures 
such as number of continuous semesters enrolled. In the science topic area, math pretest scores 
are acceptable for demonstrating equivalence if a science pretest is not available. Researchers 
should review specific WWC topic area protocols to learn more about how domains are defined 
and also topic- and domain-specific equivalence requirements. 


Q20: In cases where propensity score matching is used to construct a comparison group, can 
equivalence of the propensity scores be used as evidence of equivalent groups pre-intervention to 
meet WWC standards? 

Answer: No. Although researchers may choose to use baseline outcomes and other demographic 
characteristics to calculate the propensity scores, equivalence of the propensity score alone is not 
sufficient for determining baseline equivalence according to WWC standards. Baseline 
equivalence for the analytic sample (the sample used to measure program impacts) must be 
demonstrated for each of the required baseline characteristics. (Note, however, that including the 
variables for which baseline equivalence is required in the matching process will make it more 
likely that equivalence on the required variables will be achieved.) 


Q21: What if I have only test scores from a prior school year as a baseline measure? May these 
be used as a measure of baseline equivalence to meet WWC standards with reservations? 

Answer: Yes. Test scores from prior school years may be used to demonstrate equivalence. 
Because achievement can change over time (for example, achievement levels may increase or 
decrease over the summer, depending a youths’ summer experiences), measuring achievement 
immediately before the start of the treatment will provide the most accurate depiction of each 
study participant’s starting point and is the preferred approach. However, demonstrating 
equivalence on measures from a previous year when immediate measures are not available is 
suitable. 
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Q22: I am able to collect pretest information only after groups are formed and the intervention 
has begun. Can I use this as a measure of baseline equivalence and still meet WWC standards 
with reservations? 

Answer: Yes. Pretest scores obtained after groups are formed and intervention has begun can be 
used to demonstrate baseline equivalence. However, authors should be cautious in collecting 
these data too long after the intervention has begun because, assuming that the intervention will 
have an effect on test scores, it is possible that scores may already start to diverge by the time 
students take the pretest. If the analytic sample includes groups that are not equivalent when 
pretest scores are obtained, then the study would not meet WWC standards, even if careful 
matching procedures had been implemented before the intervention began. Also, the difference 
between the pretest and posttest will no longer measure the full impact of the program, which 
may reduce the chance that the study will detect significant differences in outcomes. 


Q23: Is demonstrating equivalence needed for subgroup analyses? 

Answer: Yes, it is important to demonstrate baseline equivalence for any impact analyses 
conducted, including full sample and subgroup analyses. The WWC reviews equivalence 
in formation for the subgroups in order to determine whether subgroup impact analyses meet 
WWC standards. 


Q24: 1 am using gain scores from pretest to posttest as my outcome for both the treatment and 
comparison groups. Do I also need to adjust for differences using covariates to meet WWC 
standards with reservations? 

Answer: Yes. If pretest differences in a QED fall between 0.05 and 0.25 standard deviations, the 
study must include the pretest as a covariate in a statistical model (such as ANCOVA or 
regression analysis), regardless of whether the study uses gain scores. Just using a gain score 
does not provide sufficient statistical control for the baseline measure, because it does not 
account for the correlation between the outcome and the baseline measure. When a QED study 
uses gain scores as outcomes, the WWC typically will request the unadjusted posttest means and 
standard deviations from the authors. In this way, effect sizes that are comparable to the WWC 
standard effect size can be calculated (Hedges’ g) because the standard deviation of a gain score 
is typically smaller than the standard deviation of the posttest. 


Sample Loss and Power 


Q25: Do I need to worry about attrition (sample loss) in my QED to meet WWC standards with 
reservations? 

Answer: Not for the WWC study rating, because equivalence between the treatment and 
comparison groups is calculated using the analytic sample (that is, after any attrition may have 
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occurred from the beginning of the study. 9 Although the WWC does not factor attrition into the 
rating of QEDs, attrition could pose a challenge for a QED study. In studies where the treatment 
and comparison groups are carefully matched at baseline and followed over time, sample loss 
could destroy the original equivalence of the treatment and comparison groups. Also, substantial 
sample loss will reduce the analytic sample size, making it harder to detect statistically 
significant differences. For these reasons, researchers should always design studies that use 
procedures that will maximize data collection response and minimize sample loss. 


Q26: Is statistical power a factor in meeting WWC standards with reservations? 

Answer: No. Statistical power is not a factor in whether a study meets standards with 
reservations, but it does influence how the WWC characterizes results. A study finding that has 
an effect size of smaller than 0.25 that is not statistically significant will be considered by the 
WWC as an “indeterminate” effect. When designing their studies, researchers should consider 
what effect size can reasonably be expected and should determine the associated sample size 
requirements to ensure that their study is adequately powered to detect effect sizes of that 
magnitude as statistically significant. 


Q27: Can sample sizes be reduced as part of the matching process and still meet WWC standards 
with reservations? 

Answer: Yes. A subsample of participants can be matched and included in the analysis to ensure 
that groups are equivalent on observed pre-intervention characteristics. Researchers should be 
consistent and transparent about their methods of sample selection and matching. If the method 
for achieving equivalence in the sample compromises the integrity of the study (for example, if 
there is evidence that participants were inconsistently excluded from the research sample across 
the treatment and comparison conditions on the basis of specific characteristics of the 
participants), a study is unlikely to meet WWC standards. 


Q28: Can I impute missing outcome values in a QED and meet WWC standards with 
reservations? 

Answer: No. Only the results from the unimputed analysis will be considered when the WWC 
reviews the study against WWC standards. The WWC may request unimputed information from 
authors if needed. The reason for this is that when researchers impute outcomes, they substitute 
missing values with a best-guess estimate of what the value would have been if the information 
had been available. Researchers use a variety of methods to impute missing values. However, for 
QEDs (and RCTs with substantial sample loss), the WWC asks for the unimputed data because 
of a concern that not enough information is known about the missing sample members to be able 
to estimate what an outcome or covariates would have been if it had not been missing. For 
example, in a study of a postsecondary mentoring program where youths volunteered to be part 


9 The WWC standards include levels of acceptable attrition rates for RCTs but not for QEDs. This is because RCTs 
that have high attrition run the risk of no longer having equivalent treatment and control groups. For this reason, 
high attrition RCTs cannot meet WWC standards without reservations. Like QEDs, they must demonstrate that their 
analytic samples are equivalent at baseline to be eligible for a rating of “meets WWC standards with reservations.” 
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of the program and are compared to those who chose not to participate, treatment group members 
who do not complete a follow-up survey may be those who did not engage with the program; 
whereas missing comparison group members may be a more diverse set of students who weren’t 
available for follow-up and feel less connected to the study. In this case, it would be difficult to 
make assumptions about what the outcomes would have been if the researchers had been able to 
obtain the outcomes for everyone, and inaccurate assumptions made during imputation could 
inappropriately change the estimated magnitude of the program effect (or, in other words, the 
estimated effects could be “biased.”) The What Works Clearinghouse Procedures and Standards 
Handbook, Version 3, 0 discusses the use of actual versus imputed measures and describes 
acceptable imputation methods on pages 18-19. 


Q29: Can missing baseline data that are needed to demonstrate group equivalence be imputed? 

Answer: No. A QED must demonstrate equivalence of the groups in the analytic sample by 
using actual observed data to meet WWC standards with reservations. This is because using 
imputed data to assess baseline equivalence could bias a study toward demonstrating 
equivalence. For example, this bias could occur when imputed results are based on very little 
information, resulting in similar imputed values for the treatment and comparison groups. 


Analytic Techniques 

Q30: What other aspects of study design are important considerations when the WWC reviews 
studies against standards? 

Answer: When a study does not take certain design features into account in the analysis, the 
WWC will either request information from the study authors in order to re-analyze the data or 
will re-analyze the data by using defaults. 10 These adjustments can result in changing a study- 
reported statistically significant finding to one that is not statistically significant. Researchers 
planning a study should consider these issues at the design stage to ensure consistency of 
interpretation. The following specific design shortcomings can lead to WWC adjustment or 
reinterpretation of findings: 

• A study has a clustered design but does not account for clustering in the analysis. For 

example, if the treatment occurs at the classroom or school level, the study must statistically 
adjust for the clustered structure of the data. Researchers should design and analyze studies 
that appropriately adjust for clustering to obtain more precise estimates than the WWC would 
be able to calculate on the basis of simple defaults. 

• A study does not adjust for testing multiple comparisons within the same outcome 
domain. The more outcomes in an analysis that measure a similar concept, the higher the 
probability that researchers may find at least one statistically significant effect by chance. 

The WWC uses the Benjamini-Hochberg procedure when multiple comparisons arise. 
Researchers should conduct multiple comparison adjustments when they are analyzing 


10 Adjustments to statistical significance do not affect whether a study meets WWC standards but will affect how the 
WWC characterizes the findings. 
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impacts for multiple similar outcomes . 11 This issue also has bearing on study design decisions 
and sample size requirements. Researchers who limit the number of outcomes measured 
within a domain and focus on the strongest and most relevant measures will have a greater 
ability to detect statistically significant program effects than if they include a wide range of 
outcomes. The more outcomes included, the larger the sample that will be necessary to detect 
statistically significant effects. Study designers should factor in the number of outcomes 
within a domain (in addition to other factors necessary to do an appropriate statistical power 
analysis) when assessing the sample size requirements for a study. 


Q31: Does a QED need to incorporate into its analysis how the intervention was actually 
implemented? 

Answer: The WWC does not have standards that take into account or require adjustments based 
on how an intervention was implemented in a specific study. However, information about how 
the intervention was implemented, such as measures of dosage or implementation fidelity, might 
provide important contextual information for understanding the implications of a study. For that 
reason, it is a good idea to collect data on implementation and describe implementation when 
presenting the results of impact analyses. 


Common Pitfalls: Reasons Why QED Findings Fail to Meet WWC 
Standards with Reservations 12 

Knowing what problems researchers encounter when conducting QED studies can help future 
researchers avoid these problems. The following pitfalls are some of the most common ones that 
keep a QED study from meeting WWC standards: 

• An inability to attribute measures of effectiveness solely to the intervention, or, in other 
words, the study has a “confounding” factor. As described in the “WWC Standards” section 
and in Appendix A, a confounding factor occurs when a characteristic is completely aligned 
with either the treatment or comparison condition. The most common confounds occur when: 

■ A study compares outcomes for a cohort of participants in one year to a cohort in an 
earlier year. 

■ A treatment or comparison condition is clustered within one classroom, school, or 
district. For example, all students assigned to the treatment have the same teacher. 

■ The study uses different methods to collect data for the treatment and comparison groups. 
For example, researchers collect survey data from the treatment group and administrative 
records for the comparison group. 


11 The What Works Clearinghouse Procedures and Standards Handbook. Version 3,0 provides detailed guidance on 
acceptable methods for adjusting for multiple comparisons. 

12 A webinar presented by the WWC on March 3, 2015, also discusses common pitfalls that cause a QED study not 
to meet WWC group design standards. Issues most relevant to RCT studies are discussed in a July 21, 2014, webinar 
conducted by the WWC. 
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• The outcomes do not meet the requirements indicated in the WWC standards. Most 

commonly, this is because 

■ The outcome is directly aligned with the content of the intervention (for example, a 
reading assessment that includes comprehension passages to which the treatment group 
students had already been exposed). 

■ Either reliability statistics for outcomes (for example, inter-rater reliability of internal 
consistency) fall below the WWC’s topic area protocol’s standards, or reliability 
information is not provided. 

• Equivalence of the analytic sample on pretest or other WWC protocol-specified 

demographic measures is not appropriately demonstrated. This commonly occurs when 

■ The authors do not collect baseline data to detennine equivalence. 

■ The study does not present equivalence information for the analytic sample. This often 
occurs when researchers provide equivalence infonnation for the original baseline sample 
and not the sample that is being used in the analysis. In other cases, researchers collect 
pre-intervention data but do not assess or report whether groups are equivalent. 
Researchers who develop data collection and analysis procedures that ensure collection 
of appropriate pre-test and demographic measures and who focus equivalence 
assessments on the analytic sample will have the capacity to demonstrate that the 
treatment and comparison groups are similar, thus making a QED eligible to meet WWC 
standards with reservations. 

■ The analytic sample is not equivalent on the required measures according to WWC 
standards for demonstrating equivalence (see Appendix A for more details). This could 
occur even if groups are equivalent at the beginning of a study, because sample makeup 
can change over time. (Note that a study could meet WWC standards with reservations in 
some outcome domains and not in others if the topic area protocol specifies that 
equivalence needs to be demonstrated only within an outcome domain such as reading, 
math, behavior, and so forth). 

• The study uses analysis methods that do not meet WWC standards. This often happens 

when 

■ The study needs to adjust for baseline differences but does not include appropriate 
covariates in the analysis. 

■ The QED study uses imputation methods to fill in missing values for outcome measures. 
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Appendix A. Checklist for Quasi-Experimental Designs; Study 
Design Characteristics to Consider 


The checklist and accompanying table in this appendix highlight key issues that researchers 
should consider when designing strong QED studies. The checklist is broken into two sections. 
The first section focuses on design issues that could influence whether a study can meet WWC 
standards with reservations. The second section covers other general design issues that 
researchers should factor in at the planning stage. Each item in the checklist is explained in 
further detail in Table A.l. 


Checklist for QEDS during the Study Design Phase 

Is my QED study designed to meet WWC standards with reservations? 

□ The study will compare two distinct groups— a treatment group and a comparison group. 

□ The comparison group will be drawn from a population similar to that of the treatment 
group, and groups will be equivalent on observable pre-intervention characteristics. 

□ The contrast between the treatment and comparison groups will measure the impact of 
the treatment that I am interested in. 

□ There will be no known confounding factors. 

□ The study will collect pre-intervention measures of the primary outcomes of interest as 
well as background characteristics at baseline. 

□ The study will collect valid and reliable outcome data that are most relevant to assess 
intervention effects. 

□ The data collection process will be the same— same instruments, same time, same 
year— for the treatment and comparison groups. 

Is my study designed with additional qualities of a strong QED? 

□ The study has pre-specified and clear primary and secondary research questions. 

□ The study results will generalize to a policy or program-relevant population. 

□ The study has established clear criteria for research sample eligibility and matching 
methods. 

□ The study will have an analytic sample size large enough to detect meaningful and 
statistically significant differences between the treatment and comparison groups. 

□ The study is designed to detect meaningful and statistically significant effects for 
specific subgroups of interest if this is a high priority for my study. 
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Table A.l. Study Design Characteristics to Consider When Planning a Strong QED 


Study design 
characteristic 

This is important because . . . 

How does this issue relate to the WWC 
standards? 

General considerations 

A. The study will 
compare two distinct 
groups — a treatment 
group and a 
comparison group. 

To measure the effect of a program or 
practice, a treatment group that 
receives the intervention must be 
compared to a separate comparison 
group that has not received this 
intervention. When these groups are not 
distinct (for example, the same group of 
students before and after a treatment), 
then it is impossible to isolate the effect 
of the intervention (for example, regular 
maturation could explain changes in 
outcomes over time). 

To be eligible to meet WWC standards, a 
study must have at least two distinct 
groups that are compared (sometimes 
there are more than two groups if multiple 
interventions are being tested or if there 
are multiple comparison groups). The 
standards do not specify criteria for how 
researchers form these groups in QEDs. 
Retrospective data based on extant 
(already collected) data and prospective 
nonrandomized design studies that rely on 
new data can both be used to form the 
groups. Despite the fact that the WWC 
standards do not have restrictions about 
how groups can be formed, choosing the 
right groups can have major implications 
for what will be tested and for whether the 
study can meet WWC standards with 
reservations (see items B through G in the 
“Study design characteristic” column of this 
table). 

At the study design stage, researchers 
should confirm that groups are distinct. 
Researchers must weigh issues related to 
cost, convenience, and timing when 
determining which groups will be included 
in the study. For more issues to consider 
when determining how to form treatment 
and comparison groups, see items B 
through G in the “Study design 
characteristic” column of this table. 


DIR, Inc. 


Designing and Conducting Strong Quasi-Experiments in Education 


19 


Study design 
characteristic 

This is important because . . . 

How does this issue relate to the WWC 
standards? 

General considerations 

B. The comparison 
group will be drawn 
from a population 
similar to that of the 
treatment group, and 
groups will be 
equivalent on 
observable pre- 
intervention 
characteristics. 

A comparison group that is drawn from 
a similar population has a stronger 
chance of serving as a proxy for what a 
treatment group would have 
experienced if it had not been exposed 
to the intervention. When the 
comparison group is drawn from a 
different population or setting, 
differences in outcomes between the 
treatment and comparison groups may 
be related to the characteristics of 
different settings rather than to the 
effect of the intervention. For example, it 
may not be possible to attribute 
differences in outcomes to a treatment 
for high-needs students if all of the 
treatment group students attend schools 
that serve predominantly urban, high- 
needs students and comparison group 
students attend schools that serve a 
more diverse set of suburban students. 

If the treatment and comparison groups are 
drawn from different populations or 
settings, the study may not have an 
adequate comparison condition and so 
may not meet WWC standards. 

In a sound QED study, the comparison 
group should serve as a “mirror” to the 
treatment group. Researchers should 
analyze data at the study design stage to 
assess whether potential groups may be 
drawn from different populations. If so, 
then they should either (1) determine 
whether a different population can serve 
as a comparison group or (2) use careful 
matching techniques, such as propensity 
score matching or direct matching, on key 
characteristics that are highly related to 
desired outcomes. These efforts will help 
ensure that groups will be equivalent on 
observable characteristics. While it is not 
possible to match on unobserved 
characteristics, it is possible to use 
observable characteristics to match 
groups. 

C. The contrast 
between the treatment 
and comparison 
groups will measure 
the impact of the 
treatment that 1 am 
interested in. 

The contrast between the experiences 
of the treatment and comparison groups 
influences the interpretation of the 
program impacts. The strongest 
contrast occurs when a fully 
implemented intervention experience is 
compared to either no alternate 
intervention or a “status quo” 
educational experience (like the 
established curriculum). The contrast 
can be minimized if the comparison 
group receives a new or existing 
alternative treatment that is similar to 
the intervention or if there are low rates 
of program participation. 

In general, the nature of the contrast 
between the treatment and comparison 
groups would not exclude a QED from 
meeting WWC standards with reservations 
However, it could affect whether and how 
the study could be aggregated with others 
evaluating a similar intervention (e.g., in a 
later WWC intervention report. 

Researchers should think carefully about 
what the contrast between the treatment 
and comparison groups will likely be. This 
contrast has implications for sample 
selection (that is, choosing a comparison 
group that is not participating in a similar 
intervention). This consideration also 
highlights the importance of planning to 
measure participation rates and program 
implementation to learn more about the 
experiences of both the treatment and 
comparison groups. 


DIR, Inc. 


Designing and Conducting Strong Quasi-Experiments in Education 


20 


Study design 
characteristic 

This is important because . . . 

How does this issue relate to the WWC 
standards? 

General considerations 

D. There will be no 
known confounding 
factors. 

Researchers can more confidently 
attribute effects to an intervention when 
the only explanation for differences is 
that the treatment group received a 
program and the comparison group 
didn’t. When another characteristic (a 
“confounding factor”) that is unrelated to 
the intervention is present in either the 
treatment or comparison condition but 
not both, it is no longer possible to say 
with confidence that differences are due 
to the treatment. Differences could be 
due to that other characteristic. This can 
occur when there is only one “unit” in 
one or both conditions. For example, if 
there is one treatment teacher and two 
comparison teachers, and the treatment 
teacher is highly motivated and 
engaging, then which is having an 
effect — the treatment or the attributes 
of the treatment teacher? A similar 
situation can occur, for example, when 
a study compares students from one 
academic year to a prior academic year. 
What accounts for the differences — the 
program or other things that occurred 
during these academic years (like 
changes in leadership, staffing, or 
alternate curricular offerings? 

Any study that has a confounding factor in 
which there is a known characteristic that is 
completely aligned with the treatment or 
comparison condition will not meet WWC 
standards. 

One exception is if a treatment is bundled 
with another intervention. A QED study of 
this type could meet WWC standards with 
reservations but it may not be able to be 
aggregated with others evaluating a similar 
intervention (e.g., in a WWC intervention 
report). 

Although it is not always possible to plan 
in advance for all contingencies that may 
arise during a study, researchers should 
carefully consider potential confounds 
during the sample selection process. Any 
potential confounds that arise during the 
course of a study should be documented 
carefully to help inform interpretation of 
study findings. For example, a QED study 
may have initially been designed to 
compare a new science supplemental 
program to no supplement. However, 
during the course of the study, all of the 
school principals in the treatment group 
schools decided jointly to also use a new 
science curriculum while the comparison 
group continued with its existing science 
curriculum. In this example, the study 
would no longer be able to isolate the 
effects of the science supplement alone, 
and researchers would need to be clear 
that they are now testing the effects of a 
combination of a new science curriculum 
and supplement. 
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Study design 
characteristic 

This is important because . . . 

How does this issue relate to the WWC 
standards? 

General considerations 

E. The study will 
collect pre- 
intervention measures 
of the primary 
outcomes of interest 
as well as background 
characteristics at 
baseline. 

QED studies that collect pre- 
intervention measures can help provide 
evidence that the treatment and 
comparison groups were similar before 
program implementation, thus making 
evidence of program effectiveness more 
plausible. Because participants are not 
selected at random in QEDs, the 
treatment and comparison groups may 
differ in ways that we can observe as 
well as ways that we cannot. These 
initial differences, if they are related to 
program outcomes, could bias 
estimates of program effects. 

All QED studies have to have baseline 
equivalence for their analytic sample in 
order to meet WWC standards with 
reservations, and therefore must collect 
appropriate pre-intervention measures. 

Researchers can use pre-intervention 
data to help formulate well-matched 
groups, to assess whether groups are 
matched, and to analyze and statistically 
control for pre-intervention differences in 
outcomes and other background 
characteristics. 

F. The study will 
collect valid and 
reliable outcome data 
that are most relevant 
to assess intervention 
effects. 

Studies with strong outcomes will 
provide the most useful evidence of 
program effectiveness. The most useful 
outcomes are not overly aligned with 
the intervention being tested, are 
general enough to be policy relevant, 
are replicable in other studies, and are 
specific enough so that researchers 
would expect that the intervention would 
affect them. 

In general, QED study findings for 
outcomes that lack validity (i.e. , don’t 
measure what they are supposed to 
measure), reliability (i.e., aren’t measured 
consistently), or are overaligned (i.e., 
measure content that is covered explicitly 
in the intervention but not comparison 
condition) will not meet WWC standards. 

Researchers should carefully select 
outcomes that have strong psychometric 
properties and are most relevant to 
measuring program effectiveness. 
Whenever possible, researchers should 
try to use strong pre-existing measures. 
Researchers can access many resources 
to see the wide array of outcomes 
currently available. When it is not 
possible to use existing measures, then 
researchers should carefully design 
outcomes that are not overly aligned with 
the intervention, and they should 
document the development process of 
the outcomes and psychometric 
properties. 
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Study design 
characteristic 

This is important because . . . 

How does this issue relate to the WWC 
standards? 

General considerations 

G. The data collection 
process will be the 
same — same 
instruments, same 
time, same year — for 
the treatment and 
comparison groups. 

When data are collected in a similar 
fashion, researchers can be more 
confident that differences in outcomes 
between the treatment and comparison 
groups are not due to the method of 
data collection. Differences in data 
collection can occur when, for example, 
data for the treatment group come 
directly from a student or teacher 
survey, but data for the comparison 
group come from administrative 
records. The timing of the data collected 
also needs to be the same for both 
groups. 

Differences in data collection procedures, if 
completely aligned with the treatment or 
comparison conditions, are a “confound” 
(see “C” in the “Study design characteristic” 
column), and would not meet WWC 
standards. For example, this confound 
could occur if all treatment group members 
are surveyed and all comparison group 
member data come from administrative 
records. 

During the design phase, researchers 
should plan data collection procedures to 
ensure that no confound is related to data 
collection. In addition, careful preparation 
and a clear data collection process can 
help to improve the quality of data 
collected and reduce sample attrition. In 
particular, if the study is a prospective 
study, researchers should make every 
effort to reduce both overall sample loss 
and differential sample loss between the 
treatment and comparison conditions 
(which could lead to the analytic sample 
no longer being equivalent, even if careful 
matching had occurred at the beginning 
of the study). 

H. The study has pre- 
specified and clear 
primary and 
secondary research 
questions. 

A carefully planned study that has 
specified primary and secondary 
research questions is more credible to 
its audience because it shows that 
researchers were not going on a “fishing 
expedition” to find significant results. It 
also helps to focus analyses on the 
most critical and relevant outcomes. 

This is good practice but the WWC 
standards do not address the quality of the 
research questions or the division of 
outcomes into primary or secondary. 

Researchers should take the time at the 
beginning of designing a study to 
consider the most critical research 
questions and should use these 
questions to frame other design issues, 
such as sampling, matching techniques, 
outcome selection, and analysis and 
reporting plans. 

1. The study results 
will generalize to a 
policy or program- 
relevant population. 

Even if a sound study is designed in 
which similar groups are being 
compared and the contrast is clear, if 
the design is not representative of a 
relevant population, the results of the 
study will be of limited use to 
policymakers and practitioners. 

The WWC standards focus on internal 
validity, which is related to how confident 
we are that the findings are an accurate 
depiction of an intervention’s effectiveness, 
and do not address the composition of 
study populations or generalizability of the 
findings to broader populations. 

Researchers should choose a study 
population that is most relevant to 
answering questions (1) about program 
effectiveness for a policy and practitioner- 
relevant population and, when applicable, 
(2) about whether the population is 
relevant to the particular grant program to 
which they are applying. If a convenience 
sample is the only sample available, 
researchers should think carefully about 
whether results from the study will be 
useful and relevant. 
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Study design 
characteristic 

This is important because . . . 

How does this issue relate to the WWC 
standards? 

General considerations 

J. The study has 
established clear 
criteria for research 
sample eligibility and 
matching methods. 

Having clear eligibility criteria enables 
researchers to focus recruitment and 
sample selection on the most relevant 
population. It also helps researchers 
formulate matching strategies to ensure 
the equivalence of treatment and 
comparison groups. 

The WWC standards do not address 
specific requirements regarding sample 
eligibility or matching techniques. 

Researchers who conduct well- 
implemented QED studies consider 
issues related to sample recruitment and 
participant eligibility early in the study 
design process. Appropriate matching 
procedures should be determined on the 
basis of (1) each study’s particular 
situation and (2) factoring in key issues 
related to availability of baseline data, 
sample size availability, and key concerns 
about baseline characteristics that are 
most related to the outcomes of interest. 

K. The study will have 
an analytic sample 
size that is large 
enough to detect 
meaningful and 
statistically significant 
differences between 
the treatment and 
comparison groups. 

A study with adequate statistical power 
will have a large enough sample size to 
detect expected statistically significant 
effects. This will prevent the danger of 
making an incorrect assessment that a 
program doesn’t affect outcomes when 
it actually does. 

The WWC standards do not address 
statistical power, but taking it into account 
is good practice. 

Statistical power should be carefully 
analyzed in the study design phase to 
ensure that there is an adequate sample 
to detect the expected differences 
between the treatment and comparison 
contrasts. Researchers should carefully 
consider what the expected effect may be 
and how well other covariates may help 
reduce variation in outcomes. Also, 
researchers should analyze the power 
ramifications that are due to the clustered 
nature of results if the intervention will be 
provided at the cluster (e.g., classroom, 
school) level. 

L. The study is 
designed to detect 
meaningful and 
statistically significant 
effects for specific 
subgroups of interest 
if this is a high priority 
for my study. 

If researchers are specifically interested 
in knowing whether a program works for 
specific subgroups, then a study should 
establish baseline equivalence of the 
subgroup treatment and comparison 
groups and be designed with a large 
enough sample within these subgroups 
to detect expected differences for these 
subgroups. 

The WWC standards do not address 
statistical power but taking it into account is 
good practice. 

Researchers should determine, in 
advance, whether there are specific high- 
priority subgroups of interest and should 
design a study with enough of a sample 
to be able to detect expected effects. 
They may also review relevant WWC 
topic area protocols to see whether the 
WWC would report these findings as 
supplemental evidence of intervention 
effectiveness. 
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Study design 
characteristic 

This is important because . . . 

How does this issue relate to the WWC 
standards? 

General considerations 

M. The planned 
analysis methods will 
appropriately reflect 
the research design 
and sample selection 
procedures. 

Well-designed analysis plans improve a 
researcher's ability to report credible 
estimates of program effects. Analyses 
that are not well designed and 
implemented run the risk of yielding 
imprecise estimates of program effects 
that researchers and policy makers may 
not consider useful. 

Certain analysis methods affect whether a 
study meets WWC standards. QED studies 
that require statistical adjustment for 
baseline differences (see the equivalence 
discussion on page 4 and in “Study design 
characteristic B” in this table) will not meet 
WWC standards if appropriate covariates 
are not included in the analyses by using 
methods such as regression or ANCOVA 
(gain score, ANOVA, or difference-in- 
difference analyses would not be 
acceptable). QED studies that impute 
baseline or outcome data also will be rated 
as not meeting standards by the WWC. 

Researchers should carefully plan their 
analyses in advance, including 
determining the best statistical model that 
fits the research design and sampling 
methods, determining the primary and 
secondary outcomes and planned 
adjustments, and planning appropriate 
sensitivity analyses to see how results 
vary, depending on assumptions made. 

N. The study includes 
a clear plan to 
document the 
implementation 
experiences of the 
treatment and 
comparison 
conditions. 

Careful documentation of program and 
comparison experiences provides 
invaluable evidence to help understand 
why a program did or did not find 
significant results. It helps to document 
the contrast between the treatment and 
comparison conditions, whether the 
program was implemented as intended, 
and the degree to which research 
subjects participated in the program. 

The WWC standards do not address 
implementation issues but it is good 
practice to take this into account. The 
WWC does narratively describe program 
implementation when reporting on studies 
that meet WWC standards. 

Researchers who are planning a 
prospective study should develop a 
strong implementation analysis plan that 
measures and documents adherence to 
the intervention, the contrast between the 
treatment and comparison conditions, 
and contextual issues specific to the 
study (such as changes in the 
environment or adaptations that were 
made over time). Researchers should 
also consider including in their study a 
careful assessment of program quality, 
although it might be costly. 
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